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Abstract 

The problem of controlling a finite state Markov chain in the pres- 
ence of an adversary so as to ensure desired performance levels for a 
vector of objectives is cast in the framework of Blackwell approacha- 
bility. Relying on an elementary two time scale construction a control 
scheme is proposed which ensures almost sure convergence to the de- 
sired set regardless of the adversarial actions. 

Key words: controlled Markov chains, Blackwell approachability, two time 
scales, stationary strategies, multi-objective optimization 

1 Introduction 

Many control problems in practice have two features that put them outside of 
the classical framework of deterministic or stochastic optimal control theory: 
presence of unknown disturbances and multiple objectives. One common 
approach for addressing the former issue is to treat the disturbances as ac- 
tions of an adversary and plan against the worst case scenario thereof. This 
makes the problem a two person zero sum game. While the classical two per- 
son zero sum stochastic games are fully anaiyzable through the associated 
Shapley equation, this is not the case when there are many objectives. In a 
seminal article, Blackwell [T] provided a framework for addressing this 'vector 
minmax' problem in case of repeated games, providing both the necessary 
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and sufficient conditions for attainability of the objectives (what came to be 
known as Blackwell approachability) and a scheme for achieving the same. 
This is becoming a popular model for addressing engineering problems with 
aforementioned features, see, e.g., Hou et al [2] for a recent application. 
The framework has also found application in strategic learning literature in 
economics and computer science, see, e.g., Young [5]. As observed above, 
many engineering situations call for going beyond the repeated game model 
and consider a controlled Markov dynamics instead. In an important work, 
Shimkin and Shwartz [I] studied this problem for controlled Markov chains 
and proposed a scheme to ensure Blackwell approachability. Their scheme 
depends on updating strategies at return times to a fixed state, which allows 
them to exploit the regenerative nature of such visits. This is necessitated 
by the fact that there appears to be a need to hold the policy fixed for some 
time - the interval between two return times in their case - for the 'learning' 
to take place. For a large chain, the return times can be infrequent, rendering 
the convergence slower. Motivated by this, we propose an alternative scheme 
here that holds a policy constant for durations that are short initially and 
can become longer gradually, thus capturing the 'exploration-exploitation' 
trade-off. Each choice of strategy is associated with a positive re-scaled 
time duration and whenever the player switches to a new strategy he retains 
it for the associated re-scaled duration of time. Almost sure convergence of 
the running average cost to the desired set is then established under standard 
conditions. A major ingredient in our proof is an elementary two time scale 
argument and the proposed scheme is designed to exploit the two time scale 
feature in an essential way. 

The paper is organized as follows. Section [2] describes the problem set- 
up and introduces the notation and some preliminary concepts. Section [3] 
develops an elementary two time scale result which plays a crucial role in 
the proof of convergence and around which our scheme is built in the first 
place. Section H] proves the main convergence result, Theorem [T8j Section [5] 
concludes by outlining some further possibilities. 

2 Basic setup 

The model. Consider a system evolving as a controlled Markov chain on 
a finite state space S with a reward associated with each transition. We 
assume that the reward is always some vector from a compact set K C M d . 
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Let U p and U a be finite action spaces. Let (9 n ) denote the aforementioned 
controlled Markov chain on S with transition kernel p(9'\6, u p , u a ) for 9', 9 G 
S, u p G U p ,u a G U a . Let V(U P ) denote the set of probability distributions 
on the space U p . Let IP denote the set of all maps, or strategies, from S to 
V(U P ). Similarly, let IP denote the set of all strategies from S to V(U a ). 
Depending on the past the player and the adversary independently choose 
their current strategies from IP and IP respectively Let (u p ), (u®) be the 
actual control sequences chosen by the player and the adversary from U p , U a 
respectively. At time step n the one step reward is given by n (9 n ,u p ,u^). 
Let x n denote the vector for current average reward. The iterative equation 
for the average reward becomes 

x n+ i = x n + l/(n + 1)[k (9 n ,u p ,u^) - x n ]. 

Main goal. The aim of the main player is to have the average reward 
asymptotically approach a certain desirable subset D C K f C K d ) by suit- 
ably choosing his strategy at each step. More precisely, the player seeks 
to choose his sequence of strategies in such a manner that no matter what 
sequence of strategies the adversary chooses, with probability one all limit 
points of the sequence (x n ) lie in D where D denotes the closure of D. 

Assumptions. In our analysis we restrict our attention to the case where 
D is convex. However, see Section for possible extension to the case of 
non-convex D. Next, assume that when the strategies for the main player 
and the adversary are held fixed at arbitrary strategies ir p G IP and TT a G IP 
respectively then the Markov chain (9 n ) is ergodic. Let ^( 7rP ' 7r<1 )(-) denote the 
corresponding stationary measure on state space 5* with the strategies for 
the player and the adversary held fixed. Define the corresponding average 
reward k(ti p , 7r a ) as 

k(tt p , vr a ) := J2 Yl <9,u p ,u a )r ] ^ P ^\9)7r p (u p \9)7r a (u a \9). 

For any point x, let xq be the (unique) point in D closest to x. For the 
rest of this paper we work under the following assumption which is standard 
for Blackwell approachability: 

Assumption 1. For every x G K\D there exists a player strategy ir p satis- 
fying the following inequality: 

inf (R(tt p , 7r a ) — in, Xr> — x) > 0. 

i-en^ v x ' 1 
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In words, the hyperplane through xq perpendicular to the line segment xx^ 
separates x from the set {k(it^., 7t a ) : 7r a G II a }. 

For p G R+, let B(x,p) denote the open ball of radius p centered at x. 

Lemma 1. There exists a map p(-) : K\D — > M. + , such that for any x G 
K\D, we have 

J?*,* BL (K{<^ a )-y D ,y D -y) > 0. (1) 



Proof. For any x G K\D, by Assumption [I] there exists a player strategy it 
and an e > such that 



inf (kOkL Tc a ) - XfjjXf) — x) > e. 

We get 



vr a ) - y D , y D - y) 
= {R(n^TT a ) -x D + (x D -y D ),y D -y) 

= (k(7I* 7T a ) - So, X 5 - x) + 7T a ) - X 5 , {VQ ~ V ~ {%£) ~ x))) + 

( x d - Vr>i dd - y) 

> e - \ ( R (K^ a ) -x 5 , (yo-y- (x D -x)))\ - \{x D -y D ,y D -y)\ 

Since sup^^a sup x . sg5 ||«(7T§, vr a ) - x D \\ < oo and sup y6 # Uj/s-j/H < °°> 
it follows that there exists a finite positive constant c such that 



(/«,7r a ) -vd>vd-v) > e - c (lks -ysll + Ik-z/ID- 

Since D is convex, the map x ^ xq must be continuous. It follows that 
there exists a p(x) > such that («(7T§, vr a ) — yf>,yf) — y) > e/2 whenever 
|| x — 2/ 1 1 < p(^)- Since this holds for any 7r a , we get 

D ^ nf , » i n L^(^' 7r °) ~ y ^ yo-y)> o. 

yeB(x,p(x)) ?r a Gn a 

□ 

For the rest of the paper we assume that p(-) : K\D — > IR + is a function 
satisfying ([1]). We now introduce the main objects needed for our analysis. 
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The sets K n , Q n and Q. For n G N, define compact sets K n as 



K n := lyeK : inf ||y - x\\ G [l/(n + 1), l/n\ 
We can write 

K\D = |J tf n . 

neN 

For jiGN, the collection {B(x, p(x)/2) : x G K n } is an open cover for K n . 
By compactness there exists a finite subcover. Let Q n be a finite subset of 
K n such that 

|J B(q,p(q)/2)DK n . 

qeQ n 

Let Q denote the union 

Q ■= |J 

neN 

The following result is immediate. 

Proposition 2. T/ie collection Q is a countable collection. 

The map Q(-). Since Q is countable, we can assign an injective (one-one) 
map / : Q — > N. Using the map /(•) we define a map Q : K\D — > Q where, 
for x G K\D, we define 

Q(x) := argmin{J(g) : x G B(q, p(q)/2),q G Q}. 
q 

The re-scaled times and the interpolated trajectory. Let t(0) = 0. For 
neN, define the re-scaled times 



t(n) = Vi- 



i=i 



Let x(-) be the trajectory obtained by linearly interpolating between the 
iterates. Thus, for any neN and t G [t(n),t(n + 1)) define 

t(n + l)-t t-t(n) 
X[t) '~ t(n+l)-t(n) ' Xn + t(n+l)-t(n) ' Xn+l ' 
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The map T(-). Define f max := sup xeK svLp, e uP ua \ \\k(9, u p , u a ) —x\\. Since 
x, K(9,u p ,u a ) G K and K is compact, it follows that t> max < oo. Clearly, for 
times Mi and 1*2, 

- x(u 2 )|| < tw|«l - «2|- 

Let T : Q — >• M. + be a map such that for every g 6 Q the following holds: 

M < T(q) < (2) 



Choice of strategy along S. We are now ready to define how the player 
should choose his strategies over time. Let n p be any arbitrary strategy 
Let S := (s n ) denote the increasing subsequence of times when the player 
changes his strategy. Start with so = 0. Assume s n is known. We consider 
two cases, x Sn G K\D and x Sn G D. If x Sn G K\D then set q = Q(x Sn ). Now 
choose the strategy n p and set 

{m-l "| 
m : 1/i > T{q) > . 
i=s n ) 

If, however, x Sn G D then choose the strategy n p and set S(n+i) = s n + 1. 



3 A two time scale result 

This section develops an elementary two time scale result needed for the 
proof of convergence. For the reader's convenience we break the proof into a 
series of smaller units. 

Lemma 3. For every x G K\D, 

B(x,p{x))nD = 0. 
Proof. If y G B(x, p{x)) R D, then y^ = y and so 

inf («(7l* 7T a ) - 2/ B , y D -y) = 0. 

But this contradicts (CD). Hence, B(x,p(x)) fl D = 0. 

□ 
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Lemma 4. For any compact set L such that Lf)D = $, we have 



Proof. Since both L and D are compact sets, it follows that 

inf \\x — y\\ =: d(L) > 0. 

x£D,y£L 

Consider any q such that q G Q m and m > j^l)- Since Q m C K m , we 
have 

• f II II ^ 1 2d ( L ) 

xed m 3 

Further, by Lemma [3], p(q) < 1/m < 2d(L)/3. It follows that if m > ^xy 
and q G Q m then B(q, p(q)/2) f]L — 0. The result follows. 

□ 

Lemma 5. Lei (s m ( n )) fre an increasing subsequence ofS. IfXim^^ x Sm(n) = 
x for some x G K\D, then along a further subsequence, denoted (s m t n )) 
again, there exists q G Q such that Q(x s , n) ) = q for all nGN. 

Proof. Since lim^oo x Sm{n) — x ^ D, there exists a compact set L such that 
L fl I) = and x Sm(n) G L for all sufficiently large n. By Lemma HI 



< oo. 



Thus there exists q G Q such that along a subsequence, denoted (s m (n)) 
again, we have x s , , G L and Q(a; s , J = o for all n G N. 

□ 

The Mannor- Tsitsiklis bound. We now introduce a set of conditions, 
labeled (t), which is needed for Theorem [6] and Corollary [7] below. To this 
end, let (s m ( n )) be an arbitrary increasing subsequence of S. Let T; and T r 
be times such that XJ < T r . Let (i m ( n )) and (r m(n )) be sequences such that 
s m (n) < lm(n) < r m(n) < s m ( n)+ i, n G N. Let (f) denote the following four 
conditions: 

It a;, , , — > s for some x G K\D. 

2^ <5(^s m{n) ) = 9 for some q G Q and all n G N. 
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3t [T h T r )c [0,T(g)). 

4 f t{l m{n) ) - t(s m (n)) -> 2} and t(r m(n) ) - t(s m(n) ) -> T r . 

Assuming the conditions of (t) hold, for l m ( n ) < j < r m („) consider the 
single step reward k w^, . At each of these time steps the player adopts 
the strategy 7r^ independently of the action chosen by the adversary. For 
9 G S and u a G U a , let K T (9,u a ) be the reward at the r th occurrence of 
(0, u a ) in the range Z m ( n ), . . . , r m r n ) — 1, . The rewards k t (9, u a ), r = 1, 2, . . . , 
are independent, identically distributed random variables with mean 

E[^(9,u a )] = J2<0^ P ^ a K(d)(u p )- 

Further, since each k (9j,u^,Uj) is chosen from a compact set, we get, for z 
in any neighbourhood of the origin, 

E[exp{(z,K T (9,u a )))} < oo, 

where (•, •) is the inner product in M. d . Define the set -R(vr^) := {k(7tP 7r a ) : 
Tc a G n a }. For a vector v, define ||u — i?(7r£)|| := inf^a^a \\v — k(tc^, ir a )\\. We 
can now invoke Theorem 6.2 of Mannor and Tsitsiklis [3]. For our setup and 
with our notation, it reads as follows: 

Theorem 6. Assuming that the conditions of (f) hold, there exists a function 
A : (0, oo) — > (0, oo] and a positive constant c , such that irrespective of the 
adversary policy n a , the following bound holds: 

< c exp (-A(e)(r m(n) - Z m (n)))- 

For the next result, note that t(r m i n \) — t(l m ( n \) = Y"^™^ 1 Under 
the conditions of (f) this implies that 

lim^ = exp(T r .-T / ). (3) 
Corollary 7. Assuming that the conditions of (f) hold, we have 

— Rv^q) — a.s. 



P 



fm(n) lm(n) 



> e 



lim 

n— >oo 
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Proof. Since lim n ^ 00 (t(r m(n) ) - t(/ m(n) )) = T r - T t > 0, it follows from fl3J) 
that r m ( n ) — i m ( n ) > Z m ( n )[exp (T r — T}) — l]/2 for n sufficiently large. Since 
l m (n) > n, we get r m(n) - Z m(n) > n[exp(T r — T/) - l]/2 for n sufficiently 
large. Plugging this estimate in Theorem [6] and noting that the constant e 
is arbitrary, a standard application of the Borel-Cantelli argument gives the 
result. □ 

The two time scale result. With Corollary [7] available for use, we are 
ready for our main two time scale result. Thus, let (s m ( n )) be an arbitrary 
increasing subsequence of S. Assume that lim^oo for some x G 

K\D. By Lemma \5\ there exists a q G Q such that along a subsequence, 
denoted again by (s TO ( n )), Q(x Sm{n) ) = q for all n G N. For n G N and t > 
define the trajectories 

ri (t \ ._ \ x(t(s m(n) ) +t) if t < t(s m(n)+1 ) - t(s m(n) ) , , 

ym(n)W • | 5 (t( Sm(n)+1 )) if t > t( Mn)+1 ) - t( Sm(n) ) W 

By the Arzela-Ascoli theorem there exists a continuous trajectory y(-) such 
that along a subsequence, denoted again by (m(n)), lim^oc y m (n)(-) = y( 4 ) 
in the topology of uniform convergence over compacts. 

Set T = T(g). For jfe G N and J fe = {0, 1, . . . , 2 k - 1 } consider the finite 
collection of intervals 

C k := { [2- k jT, 2~ k (j + 1)T) : J G J fc } . (5) 

For j G Jfe, define 

y(2- k (j + 1)T) - exp (-2- k T)y(2- k jT) 



1 - exp (-2- fe T) 



Next, with (s m (n)) denoting an arbitrary increasing subsequence of S, define 
N(q,j, k) as the following set: 



N(q,j,k) := {(x n ) : 3(s m(n) ) s.t. Q(x Sm(n) ) = g Vn and re iifc ^ i2(7tf)}. 

Proposition 8. TTie se£ N(q,j,k) is a null set, i.e., ¥[N(q,j,k)] = 0. 

Proof. Fix any interval [2~ k jT, 2~ k (j + 1)TJ in Cfc. Let (Z m (n)) and (r m ( n )} 
be sequences with s m(n) < Z m(n) < r m(n) < s m(n)+ i, n G N such that t(l mi , n) ) - 
i(s m (n)) -> 2" fe j'T and t(r m(n) ) - i(s m(n) ) -> 2~ fc (j + f )T. We have 

y(2- fc (j + 1)T) - y(2- fc j'T) _ ^ x rm(n) - x im(n) 



2 fc T n-nx> t(r m ( n )) - t(Z m („), 



9 



In terms of l m (n) and r m(n) , the equation for average reward can be written 



as 



i(n) 



^m(n) l"m(n) 
^m(n) 



Xi 



rn (n) 



Rearranging, we get 



X r m (n) (^rn(n)/fm(n))xi m(n) 
1 (J"m(ri) /'m(n)) 



Since £r m(n) — >■ y(2 fc (j + 1)T) and a^ m(n) — > y(2 fc jT), it follows from 



that 



and consequently 



^mfn) Vrn(n)/^m(n)) X l m ^ n ) 

nm 1 7r / \ — K j,k, 



lim — ; = Kj fc. 



n ^°° r m ( n ) — Z m („) 

Hence, by Corollary [7] it must be the case that 



□ 



Define C := [J k Ck- The next fact is crucial to our analysis. 
Proposition 9. The collection C is a countable collection of intervals. 
Define N to be the following set: 

N --= U U U ^fo**)- 

q£Qk£Nj£j k 

Proposition 10. The event N is a null set, i.e., P[JV] = 0. 

Proof. Both Q and C are countable collections. The result now follows from 
the fact that the union of countably many exceptional null sets is again a 
null set. 

□ 
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By virtue of Proposition EH to show almost sure convergence of sequences 
(x n ) to D it suffices to restrict attention to sequences outside N. Conse- 
quently, in what follows we shall work exclusively with sequences (x n ) outside 
the exceptional null set N. 

Theorem 11. Let (x n ) be any sequence outside the exceptional null set N. 
For (s m ( n )) an increasing subsequence of S, assume that lim^oo x s , n) = x 
for some x G K\D. Assume further that for some q G Q, Q(x s , n) ) = q for 
all n G N. Let T = T(q). Let y(-) be a limiting trajectory of the trajectories 
y m (n){-) given by ^j. Then, fort G [0,T], y(t) can be written as 



where v(-) is a Borel measurable function defined on [0,T]. Further, for 
Lebesgue almost all t in [0,T], the following holds: 



Remark. We point out that (jSJ) is a standard result in two time scale 
theory. Moreover, using Lebesque's theorem we could also show (J7J) to hold 
almost surely for any (but not all) t G [0,T]. The problem arises from the 
fact that the set [0, T] is an uncountable set and when we do a union of null 
sets, one for each t G [0, T], the union need not be a null set. We solve this 
problem by treating the interval [0, T) as a probability space and giving the 
trajectory y(t) a martingale structure. This also provides an independent 
and elementary proof of two time scale structure. 

Proof. Define Q k '■= cr(C k ), the a- algebra on [0,T) generated by C k . Let 
Q '■= V \Gk- For A the Lebesgue measure, define the scaled probability 
measure /i on [0,T) given by d/x/dA = 1/T. This acts as a probability 
measure for the probability space ([0, T), fi, Q). For t G [0, T) and k G N 



define the 'floor' f k {t) := 2~ k [2 k t/T\T. Thus, for any t G [0,T), we have 
t G [/*(*), /*(*) + 2- k T). Define M k {t) as: 





v(t)+y(t)ER(ir p q ). 



(7) 



M k (t) : 



y(f k (t) + 2- k 



T)-y(f k (t)) 



2~ k T 



Note that M k (-) is ^-measurable. Further, for t G [0,T) we have 



W[M k+l (t)\g k ] = M k {t) //-almost surely. 
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In other words, the sequence (Mfc(-))fc 6 N forms a bounded martingale in the 
filtered probability space ([0, T), /x, Q, Q k ). It follows that /i-almost surely 
the limit v (t) := lim^oo Mk(t) exists. The limit v(-) is, clearly, a measurable 
function. Note that [0, fk{t)) is a ^-measurable subset of [0,T). Letting 
A := [0,fk(t)), it is immediate that J A Mk(s)ds = J A v(s)ds. It follows that 

rfk(t) rfkit) 
y(f k (t))= / M k (s)ds = / v(s)ds. 



Letting k — >• oo gives us: 

y(t) = y(0) + / v(s)ds. 



Let t G [0, T). Set j = j(k) = [2 k t/T\ . Note that as t ranges over [0, T), 
the pair (j(k), k) still take values in a countable set. From the definitions of 
Mfc(t) and Kj^ it follows that 

v(t) = lim Mfc(t) = lim Kjt k ),k ~ V(t). 

Since (x n ) is outside the exceptional null set N, lim^oo Kj(k),k must neces- 
sarily lie in R(iTq). 

□ 

Lemma 12. Let (x n ) be any sequence outside the exceptional null set N. 
For (s m ( n )) an increasing subsequence of S, assume that lim^oo x Sm{n) = x 
for some x G K\D. Assume further that for some q G Q, Q(x Sm(n) ) = q for 
all n G N. Let T = T(q). Let y(-) be a limiting trajectory of the trajectories 
y m (n)(-) given by (fjj. Then 

inf \\y(t) -w\\< inf \\y(Q) -w||exp(-t) for allt G [0,T(q)}. 

weD weD 

Proof. For t G [0,T) let d{t) := inf wg £, \\y(t) — w\\. For any point p, let 
d p (t) := \\y(t) —p\\. Let yjjit) be the point in D closest to y{t). We have 

d{t) < d p {t)\ p= y B(t) - _ ___^ H < -d{t), 



and the result follows. □ 
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4 Almost sure convergence 

As before we present our proof as a series of short lemmas. 

Lemma 13. For s n G S, if x Sn G D then 

t(s n+ i) - t(s n ) = l/(s„ + 1) < l/s n , 

while if x G K\D then 

t(s n+1 ) - t(s n ) < T(Q(x)) + l/s n . 

Lemma 14. For x G K\D we have 

T(Q(x)) — > as x — > D. 

Proof. By definition, x G B(Q(x), p(Q(x))/2). It follows from Lemma [3] that 
B(Q(x),p(Q(x)))f]D = 0. Consequently p(Q(x)) < 2inf yeD \\x - y\\. The 
result now follows from ([2]). □ 

Lemma 15. Let (s m (n)) be an increasing subsequence of S. 7/lim n ^ 00 x Sm(n) = 
x for some x G D then 

H s m(n)+l) — t\ s m(n)) ► 0. 

Proo/. By Lemma US if x Sm(n) G D then t(s m{n)+l ) - t(s m{n) ) < l/s m(n) < 
1/n, while if x Sm(n) G K\£> then t(s m ( n)+ i)-t(s m(n )) < T(Q(x Sm{n) )) + l/s m ( n ). 
By Lemma HU T(Q(x s . J) — >• as n — > oo. Since s m („) — > oo as n — > oo, 
the result follows . □ 

Lemma 16. Lei (s m ( n )) fre an increasing subsequence ofS such that lim^oo x s 



m(n) 



x 



and lim^oo x Sm(n)+1 = y. If ye K\D then x G K\D. 



Proof. Assume x G D. Since ||x Sm(n)+1 -x, m(B) || < v mSLK (t(s m(n)+1 ) -t(s m{n) )), 
it follows from Lemma [15] that 



lim I la;. , — x s , ,11 = 0. 



This leads to a contradiction since y G K\D. □ 

Recall that iS = (s n ) is the increasing sequence of times when the player 
changes his strategy. 
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Lemma 17. Let (x n ) be a sequence outside the exceptional null set N. If y 
is a limit point of the sequence (x Sn ) then y G D. 

Proof. Assume to the contrary and let y be a limit point of {x Sn ) that is 
farthest from D. Take an appropriate subsequence such that lim^oo x Sm{n) = 
x and lim^oo x s , )+1 = y. By Lemma [TBI x G K\D. Further assume, by 
Lemma El that the subsequence is such that Q(x Sm(n) ) = q for some q G Q 
and all n G N. From our choice of y it follows that 

inf ||y — w\\ > inf ||x — u>||. 

weD weD 

But by Lemma [T2l we get 

inf \\y — w\\ < inf \\x — w\\ exp (— T(q)). 

w£D weD 

Since T(q) > this leads to a contradiction and the result follows. □ 

Theorem 18. Let (x n ) be a sequence outside the exceptional null set N. If 
x is a limit point of the sequence (x n ) then x G D. 

Proof. By taking suitable subsequences assume that lim n ^ 00 where 
s m (n) < w m (n) < s m(n)+i f° r all ?2 G N with (s m ( n )) some increasing sub- 
sequence of S. Assume further that lim^oo x Sm(n) = y for some y. By 
Lemma THy G 5. Since \\x Um(n) - x Sm[n) \\ < v max (t( 

)), it 

follows from Lemma fl~5l that lim„ \\x v , , — i, , ,11 = 0. Thus x = y and 
the result follows. 

□ 

5 Conclusion 

We have established the a.s. convergence of our scheme to the desired limit 
set for finite state controlled Markov chains. In conclusion we point out some 
future directions. 

Extension to non-convex D. For non-convex D in general, the existence 
of a 'nearest point' in D from any point outside D is guaranteed. A scheme 
along above lines can be conceived wherein one uses piecewise constant poli- 
cies that ensure decrease of distance from D if such policies are known to 
exist. 



14 



Countable state space. Under suitable uniform stability assumption or 
'near-monotonicity' condition on costs, variations of the above scheme can 
be proposed for Blackwell approachability. This will be pursued in a future 
work. 

Computational issues. The above scheme is an 'ideal' scheme in so far as 
it ignores actual computational aspects. A practical implementation would 
raise further issues such as recursive on-line computation of policies, learning, 
etc. 

A combination scheme. A variation that seems promising is to combine 
the approaches of this paper and Shimkin and Shwartz [1], switching strate- 
gies when the currently adopted strategy exhausts its allotted time, or when 
the chain returns to a prescribed state, whichever occurs first. One expects 
similar results, though the analysis will be messier. 
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